Queer Coded: Supervised ML to Predict LGBTQ+ Acceptance in American Neighborhoods

STAT 301-3 Data Science 3 with R

Author

Vlad Nevirkovets, Sean Pascoe, Dylan Yan

Published

June 8, 2023

Introduction

Neighborhoods with a high queer population, or Gayborhoods, have been urban areas in which LGBTQ+ people have found community and built culture worldwide. However, these geographic areas serve as much more than the sexuality of their constituents, and have been cited as yielding robust creative economies, as well as a welcoming environment for those of many identities. Knowing this, the identity of a certain locale as a Gayborhood becomes a crucial sociological metric, with neighborhoods with a more prevalent queer identity driving social liberalism in the face of prejudice. Our analysis thus focuses on building a location-based regression model that uses a variety of parameters including housing, land use, and non-queer demographic data to predict Gayborhood degree, and, in so doing, determine whether an area is suitable for queer folks with the goal of advancing understanding of liberal areas.

Outcome Variable

In order to rank Gayborhood status, we are using a published dataset from data world1 that was analyzed to display distribution of queer communities in Jan Diehm’s 2018 article Men are from Chelsea, Women are from Park Slope. Specifically, the metrics in this dataset are combined to form a Gayborhood index, which we will use as the supervising variable for our machine learning project. This index is a holistic assessment of queerness, derived primarily from the following measures:

  • Same-sex joint tax filers
  • Unmarried partner same sex households
  • Number of Gay Bars
  • Whether or not a pride parade routes through the region

In this case, we will be training a regression model to predict Gayborhood index. The initial variable exists on a 0 to 100 scale with a mean of 2.39, median of 1.27, and a max of 67.1. In the Exploratory Data Analysis section, we will discuss measures to transform the outcome variable into a more favorable distribution.

Data collection

Because this project is based primarily on data by ZCTA, a US Census-defined zipcode proxy, we are able to pool data from two main sources – the US Census and Open ICPSR– to encompass 4 major facets of urban life.

  1. Our first predictor set2 comes from the US Census, and provides a variety of information about housing characteristics including number of housing units in an area, rent prices, and indicators of development (phone reception, income to rent ratio, etc.). This is derived from the Census’s ACS 5-year survey estimates for the year 2015. We use data from 2015 since that is also when the Gayborhood dataset was published.
  1. We also used US Census data to glean information about various demographic characteristics, as represented in our second data section3. This set primarily synthesizes demographic information that is not specifically related to sexuality (i.e. the parameters that went into the calculation of our outcome variable), but describe other characteristics of each area.
  1. We also opted to include data from Open ICPSR, which included information about parks4 in each geographical region, including number of open parks and proportion of ZCTA area that is park space. Since parks are public areas, they serve as a predictive metric for the culture of a community, and are thus important in describing neighborhood lifestyle.
  1. Finally, we included data concerning the primary method of transport to work5, also from the US Census ACS survey. This will round out our predictors by describing the movement, as well as the physical locale, within a neighborhood.

These yield a functional dataset with \(> 550\) predictors, which fall into the above categories. A skim of the first 10 variables is provided below:

Data summary
Name Piped data
Number of rows 2145
Number of columns 12
_______________________
Column type frequency:
factor 1
numeric 11
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
zip_code 0 1 FALSE 2145 017: 1, 017: 1, 017: 1, 017: 1

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
gayborhood_index 0 1 1.09 0.50 0 0.77 1.08 1.41 2.6
estimate_sex_and_age_total_population 0 1 29847.58 19524.92 304 15415.00 26827.00 39609.00 114982.0
estimate_sex_and_age_total_population_male 0 1 14550.75 9507.60 194 7460.00 13001.00 19268.00 60595.0
estimate_sex_and_age_total_population_female 0 1 15296.83 10078.78 110 7847.00 13772.00 20323.00 58520.0
estimate_sex_and_age_under_5_years 0 1 1924.05 1532.47 0 851.00 1584.00 2553.00 12098.0
estimate_sex_and_age_5_to_9_years 0 1 1865.10 1456.03 0 832.00 1585.00 2497.00 10215.0
estimate_sex_and_age_10_to_14_years 0 1 1841.39 1432.86 0 810.00 1544.00 2459.00 9785.0
estimate_sex_and_age_15_to_19_years 0 1 1880.63 1506.29 0 800.00 1527.00 2503.00 10857.0
estimate_sex_and_age_20_to_24_years 0 1 2105.32 1824.86 0 851.00 1667.00 2829.00 18938.0
estimate_sex_and_age_25_to_34_years 0 1 4717.06 3706.06 9 2024.00 3964.00 6408.00 26926.0
estimate_sex_and_age_35_to_44_years 0 1 4199.07 2902.13 0 2040.00 3730.00 5609.00 18739.0

Issues and Challenges

Collecting and cleaning the data was fairly easy. The census data is well organized, and generally has low missingness, which allowed us to only throw out a few potential predictors and will allow us to impute the rest. For cleaning of census data, we also removed columns that contained margins of error and annotations rather than the actual variables. The parks data required the removal of some text form within the data columns and conversion to numeric format. After cleaning individually, all datasets were then combined into a single dataset.

One main concern with our data is the ratio of observations to predictors, since the ratio is currently ~ 1:4. Because of this, most of our feature engineering will rely on targeted dimensionality reduction in order to explain the variance captured in these variables without overfitting to prediction variables.

Because census predictors were numerous, any variables with greater than 10% missing were removed. The missingness in these variables are assumed to be randomly missing since they were collected and reported in the US Census ACS survey.

Exploratory Data Analysis

Outcome Variable Transformation

The purpose of this regression problem is to accurately predict the outcome variable of interest, y. Based on the visualization below, a transformation should be applied to the outcome variable as it has a right skew, which will hinder the modeling process. Additionally, notice that the boxplot denotes many outliers within the data, which will lead to poor fits for certain models. This report hopes to apply a transformation such that the transformed y will appear to have data generated from a normal distribution.

We originally planned to apply a log10 transformation to our outcome variable, an index measuring how queer a ZCTA region was based on parameters defined in this dataset from data world6. However, the appearance of success from the log transformation was based off removal of 0 values, which are undefined (mapped to negative infinity in R). Examining the original data more closely, we find the following observations:

  • Calculations for totindex (the index measure) are not easily decipherable, and for a few ZCTAs, seem to be wrong entirely
  • There exist 0 values when some of the key queerness measures (e.g. number of same sex couples joint filing taxes) are positive
  • Some zipcodes that are included in the data have no documented inhabitants (like 20535 and 94128)

In attempts to remedy these issues, we first dropped any zipcodes with \(<100\) inhabitants, and then recalculated our outcome variable, based on the following formula:

\[gayborhood\_index = 40*ss\_tax\_index + 40*ss\_live\_index + 10*parade + 10*bars\] where

  • \(ss\_tax\_index\) is a measure of what percent of couples filing jointly are same sex couples (normalized by the maximum value, on a 0 to 1 scale)
  • \(ss\_live\_index\) is a measure of what percent of unmarried couples living together are same sex couples (normalized by the maximum value, on a 0 to 1 scale)
  • \(parade\) is an indicator variable indicating if a pride parade passes through that zipcode (0 or 1)
  • \(bars\) is a measure of how many gay bars there are in that area (normalized by the maximum value, 20, on a 0 to 1 scale)

This calculation accurately repurposes the original dataset for our use, which is not as concerned with splitting up men and women as was the original article Men are from Chelsea, Women are from Park Slope.

We then used a Yeo-Johnson transformation with \(\lambda = -0.268\) (calculated by the applicable step_YeoJohnson()) to transform our data and improve normality. This transformation was chosen as it allows us to keep our 0 values, which have an important physical representation in this dataset, without greatly compromising normality:

Now, although the the transformed outcome variable still does not appear to have been generated from a normal distribution, there are no outliers, which will improve model performance.

Predictor Variable Selection

The provided data listed over 500 possible predictor variables. Fitting models to all potential predictor variables would not only be computationally expensive, but increase the variance of the predictions. Because of these issues, this report aims to mechanically select the most relevant predictors for the regression problem.

A lasso model was fit to the training set to further narrow down the list of predictors. A lenient penalty of 0.01 was employed, along with recipe steps to impute missing values with K-Nearest Neighbors and remove variables with near zero variance, which would have provided little help in the prediction problem. Ultimately, about 100 predictors remained to be used in the model fitting process.

Feature Engineering

In addition to the issue of having many possible predictors variables, there are varying degrees of missingness across the predictors. Many of the variables with larger degrees of missingness were removed in the process of lasso selection, and the remaining variables had low missingness (under 2%). The remaining missing values are summarized in the table below, with higher missingness listed first.

The missingness for the rest of the models was also resolved by imputing the variables with values from a K-Nearest Neighbors model.

Our recipe was for relationships between the Gayborhood index and all of the remaining predictors after lasso selection. Additional feature engineering steps used on all of the models were removal of all near zero variance predictors (though that is probably redundant after lasso selection), updating the role of the ID column to ensure it is not used as a predictor, dummy encoding categorical predictors, and normalization of all predictors. The same recipe was used for all model types except for random forest, where dummy encoding was not used. Tree-based models can accept categorical variables and in many cases are faster when dummy encoding is not used.

We also fitted an elastic net model with partial least squares feature extraction, PLS (see more below). The recipe for this model included interaction terms between all predictors and a step for PLS.

Data Splitting and Folding

We split the data into training and testing sets with a proportion of .8 training and .2 testing, stratified by the outcome variable. We chose this as it is the convention for datasets that are large enough that withholding data from the training set is not a concern.

We used v-fold cross validation to prevent overfitting to the training data. We used 5 folds with 3 repeats as opposed to larger numbers to reduce model runtimes.

Modeling Process

Overview

Our modeling process began with importing data, conducting exploratory data analysis, and combining the datasets. We then conducted lasso variable selection, arriving at a final dataset with a smaller number of more useful predictors. We conducted relatively basic feature engineering. We then tuned and fit a number of different model types (see next section). We assessed the RMSE of each model type using the folds from cross-validation. We then chose the three best performing models to be incorporated into an ensemble model. Finally, the ensemble model was fit to the testing set to produce the final results.

Here is a flowchart outlining our modeling process:

Model Types

We fit the following model types:

  • Penalized Elastic Net Regression
  • Random Forest
    • ranger engine
  • Boosted Tree
    • xgboost engine
  • K-Nearest Neighbors
    • kknn engine
  • Neural Networks
    • Using keras and Tensorflow from python through R
  • Multivariate Adaptive Regression Splines (MARS)
    • earth engine
  • Cubist ensemble regression
    • cubist engine

Model Results

We found that the Cubist ensemble performed the best, with a mean RMSE of .328. Elastic net regression (without PLS) and random forest were also in the top 3 models, all with RMSEs under .34. K-nearest neighbors and boosteed performed almost as well, both under .35. There is then a significant gap to the worse models, those being elastic net with PLS, multivariate adapative regression splines, and neural networks with keras.

Initial Model Comparison
Model Type Best Mean RMSE Standard Error Optimal Hyperparameters
Cubist Ensemble Regression (BI) 0.328 0.004 Committees = 88, Max Rules = 290, Neighbors = 0
Elastic Net (BI) 0.337 0.004 Penalty = 1.1e-08, Mixture = 0.051
Random Forest (BI) 0.339 0.003 Mtry = 91, Trees = 1900, Min N = 3
K-Nearest Neighbors (BI) 0.343 0.003 Neighbors = 22, Dist Power = 1
Boosted Trees (BI) 0.348 0.003 Min N = 38, Learn Rate = 0.25, Loss Reduction = 1.2e-09
Elastic Net (PLS) (BI) 0.370 0.003 Penalty = 0.0032, Mixture = 1
Multivariate Adaptive Regression Splines (BI) 0.376 0.003 Num Terms = 5, Prod Degree = 1
Neural Network (Keras) 0.398 0.004 Penalty = 1, Hidden Units = 5
Null Model 0.503 0.002 NA

Ensemble Model

Our three best model types, cubist ensemble regression, elastic net regression, and random forest, were combined in an ensemble model with stacks. The final ensemble kept 6 configurations of cubist ensemble, 2 configurations of elastic net, and 2 configurations of random forest. Here is a table of the individual model performances:

Ensemble Results
Model Name RMSE
cer_bayes_1_23 3.196
cer_bayes_1_24 3.216
ensemble 3.304
cer_bayesIter9 3.318
cer_bayes_1_09 3.376
cer_bayes_1_06 3.378
cer_bayes_1_05 3.414
en_bayes_1_1 3.466
en_bayesIter3 3.466
rf_bayesIter1 3.673
rf_bayes_1_3 4.326

Final Results

The RMSE of the final ensemble model was 3.30. Interestingly, this is higher than two of the ensemble members (both cubist ensemble models). The concordance correlation coefficient (CCC) of the final ensemble was .714.

Here is a plot of predicted vs true values of the Gayborhood index:

happy pride <3 such a well timed project tbh

Future Directions

we can say future research would include more recipe steps to test like splines (natural, b, whatever other step_spline stuff there is), PCA, etc.

we might actually want to try baguette instead, i feel like it slays, but i didnt do the reading so i owuldnt know.

Footnotes

  1. Jan Diehm. (2018). Gayborhoods [Data set]. The Pudding, data.world. https://data.world/the-pudding/gayborhoods. Accessed 10 April 2023.↩︎

  2. US Census. (2021). DP04: SELECTED HOUSING CHARACTERISTICS [Data set]. data.census.gov. https://data.census.gov/table?q=rent&g=010XX00US$8600000&tid=ACSDP5Y2021.DP04. Accessed 24 April 2023.↩︎

  3. US Census. (2015). DP05: ACS DEMOGRAPHIC AND HOUSING ESTIMATES [Data set]. data.census.gov. https://data.census.gov/table?g=010XX00US$8600000&tid=ACSDP5Y2015.DP05. Accessed 26 April 2023.↩︎

  4. Li, Mao, Melendez, Robert, Khan, Anam, Gomez-Lopez, Iris, Clarke, Philippa, and Chenoweth, Megan. (2020). National Neighborhood Data Archive (NaNDA): Parks by ZIP Code Tabulation Area, United States, 2018. Ann Arbor, MI: Inter-university Consortium for Political and Social Research. https://doi.org/10.3886/E119803V1. Accessed 23 April 2023↩︎

  5. US Census. (2015). S0802: MEANS OF TRANSPORTATION TO WORK BY SELECTED CHARACTERISTICS [Data set]. data.census.gov. https://data.census.gov/table?q=commuting&g=010XX00US$8600000&tid=ACSST5Y2021.S0802. Accessed 30 April 2023.↩︎

  6. Jan Diehm. (2018). Gayborhoods [Data set]. The Pudding, data.world. https://data.world/the-pudding/gayborhoods. Accessed 10 April 2023.↩︎